Efficiently Answering Probabilistic Threshold Top - k Queries on Uncertain Data ( Extended Abstract )

نویسندگان

  • Ming Hua
  • Jian Pei
  • Wenjie Zhang
  • Xuemin Lin
چکیده

In this paper, we propose a novel type of probabilistic threshold top-k queries on uncertain data, and give an exact algorithm. More details can be found in [4]. I. PROBABILISTIC THRESHOLD TOP-k QUERIES We consider uncertain data in the possible worlds semantics model [1], [5], [7], which is also adopted by some recent studies on uncertain data processing, such as [8], [2], [6]. Generally, an uncertain table T contains a set of (uncertain) tuples, where each tuple t ∈ T is associated with a membership probability value Pr(t) > 0. When there is no confusion, we also call an uncertain table simply a table. A generation rule on a table T specifies a set of exclusive tuples in the form of R : tr1 ⊕ · · · ⊕ trm where tri ∈ T (1 ≤ i ≤ m) and mi=1 Pr(tri) ≤ 1. The rule R constrains that, among all tuples tr1 , . . . , trm involved in the rule, at most one tuple can appear in a possible world. As [8], [2], we assume that each tuple is involved in at most one generation rule. For a tuple t not involved in any generation rule, we can make up a trivial rule Rt : t. Therefore, conceptually, an uncertain table T comes with a set of generation rules RT such that each tuple is involved in one and only one generation rule in RT . We write t ∈ R if tuple t is involved in rule R. The probability of a rule is the sum of the membership probability values of all tuples involved in the rule, denoted by Pr(R) = ∑ t∈R Pr(t). The length of a rule is the number of tuples involved in the rule, denoted by |R| = |{t|t ∈ R}|. A generation rule R is a singleton rule if |R| = 1. R is a multi-tuple rule if |R| > 1. A tuple is dependent if it is involved in a multi-tuple rule, otherwise, it is independent. For a subset of tuples S ⊆ T and a generation rule R, we denote the tuples involved in R and appearing in S by R ∩ S. A possible world W is a subset of T such that for each generation rule R ∈ RT , |R ∩ W | = 1 if Pr(R) = 1, and |R ∩W | ≤ 1 if Pr(R) < 1. We denote by W the set of all possible worlds. Clearly, for an uncertain table T with a set of generation rules RT , the number of all possible worlds is |W| = ∏ R∈RT ,Pr(R)=1 |R| ∏ R∈RT ,Pr(R)<1(|R| + 1). The number of possible worlds on a large table can be huge. Each possible world is associated with an existence probability Pr(W ) that the possible world happens. Following with the basic probability principles, we have Pr(W ) = ∏ R∈RT ,|R∩W |=1 Pr(R ∩ W ) ∏ R∈RT ,R∩W=∅(1 − Pr(R)). Apparently, for a possible world W , Pr(W ) > 0. Moreover, ∑ W∈W Pr(W ) = 1. A top-k query Q(P, f) contains a predicate P , a ranking function f , and an integer k > 0. When Q is applied on a set of certain tuples, the tuples satisfying predicate P are ranked according to ranking function f , and the top-k tuples are returned. For tuples t1, t2, t1 1f t2 if t1 is ranked higher than or equal to t2 according to ranking function f . 1f , called the ranking order, is a total order on all tuples. Since a possible world W is a set of tuples, a top-k query Q can be applied to W directly. We denote by Q(W ) the top-k tuples returned by a top-k query Q on a possible world W . Q(W ) contains k tuples. A probabilistic threshold top-k query (PT-k query for short) on an uncertain table T consists of a top-k query Q and a probability threshold p (0 < p ≤ 1). For each possible world W , Q is applied and a set of k tuples Q(W ) is returned. For a tuple t ∈ T , the top-k probability of t is the probability that t is in Q(W ) in all W ∈ W , that is, Pr Q,T (t) = ∑ W∈W,t∈Qk(W ) Pr(W ). When Q and T are clear from context, we often write Pr Q,T (t) as Pr (t) for the interest of simplicity. The answer set to a PT-k query is the set of all tuples whose top-k probability values are at least p. That is, Answer(Q, p, T ) = {t|t ∈ T, Pr Q(t) ≥ p}. We are interested in how to compute efficiently the answer set for a PT-k query on an uncertain table. II. AN EXACT ALGORITHM Hereafter, by default we consider a top-k query Q(P, f) on an uncertain table T . P (T ) = {t|t ∈ T ∧ P (t) = true} is the set of tuples satisfying the query predicate. P (T ) is also an uncertain table where each tuple in P (T ) carries the same membership probability as in T . Moreover, a generation rule R in T is projected to P (T ) by removing all tuples from R that are not in P (T ). Then, the problem of answering the PT-k query is to find the tuples in P (T ) whose top-k probability values pass the probability threshold. Apparently, Answer(Q, p, T ) = Answer(Q, p, P (T )). We only need to consider P (T ) in answering a top-k query. A. The Dominant Set Property For a tuple t ∈ P (T ) and a possible world W such that t ∈ W , whether t ∈ Q(W ) depends only on how many other tuples in P (T ) ranked higher than t appear in W . Technically, generation rule R ti compression rule−tuple ti rule−tuple Case 2: ti is ranked lower than all tuples in R rule−tuple, Pr(tR)=Pr(R) Case 3: ti is ranked between tuples in R generation rule R tR_left ti

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Top-k best probability queries and semantics ranking properties on probabilistic databases

There has been much interest in answering top-k queries on probabilistic data in various applications such as market analysis, personalised services, and decision making. In probabilistic relational databases, the most common problem in answering top-k queries (ranking queries) is selecting the top-k result based on scores and top-k probabilities. In this paper, we firstly propose novel answers...

متن کامل

Top-k Best Probability Queries on Probabilistic Data

There has been much interest in answering top-k queries on probabilistic data in various applications such as market analysis, personalised services, and decision making. In relation to probabilistic data, the most common problem in answering top-k queries is selecting the semantics of results according to their scores and top-k probabilities. In this paper, we propose a novel top-k best probab...

متن کامل

PhD Thesis Efficiently and Effectively Processing Probabilistic Queries on Uncertain Data Candidate

Uncertainty is inherent in many real applications. Uncertain data analysis and query processing has become a critical issue and has attracted a great deal of attention in database research community recently. The thesis, therefore, targets an important and challenging topic uncertain data management. It is a high quality and well-written PhD thesis. Five important and related aspects of uncerta...

متن کامل

Ranking queries on uncertain data pdf

Top-k queries also known as ranking queries are often natural and useful in. Ing probabilistic threshold top-k queries on uncertain data.UNCERTAIN DATA MODELS W.R.T RANKING QUERIES. Uncertain attribute based on the associated discrete pdf and the choice is.observed, the semantics of top-k queries on uncertain data can be ambiguous due to tradeoffs. Whether it is better to report highly ranked i...

متن کامل

Indexing Probabilistic Nearest-Neighbor Threshold Queries

Data uncertainty is inherent in many applications, including sensor networks, scientific data management, data integration, locationbased applications, etc. One of common queries for uncertain data is the probabilistic nearest neighbor (PNN) query that returns all uncertain objects with non-zero probabilities to be NN. In this paper we study the PNN query with a probability threshold (PNNT), wh...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007